Partitioning Register File to Reduce Access Time
نویسندگان
چکیده
In a wide superscalar processor, the amount of time it takes to execute an application depends on the instruction latency and the amount of instruction level parallelism (ILP) that can be extracted from the application. One important factor which influences the instruction latency is the number of cycles it takes to access the register file. Whether it is a CISC architecture or a RISC architecture, practically every instruction accesses the register file for one or more operands. Therefore, its hardly surprising that processor architects have over the years designed architectures which enabled a one cycle read/write access to the register file. Single-cycle accesses are also preferred because larger access times require deeper pipelines, and deeper pipelines induce more hazards, larger branch penalties, and more complex hardware for hazard detection and data forwarding. Studies have shown that many of the ILP increasing techniques employed in wide-issue superscalar processors increase the demand for registers [3]. At the same time, an increased issue width of the processor requires additional register ports. Typically, a 4-wide superscalar processor needs to have at least 8 read ports and 4 write ports on its register file. Both of these affect the register file not only by making it much bigger but also much slower. The access time of a register file consists of two distinct components: the wire propagation delay and the fan-in/fan-out delay. Register files typically contain long word-lines and bit-lines, which can take a long time to propagate a signal across their length. For the kind of register file structures considered here, the wire propagation delay is far greater than the fan-in/fan-out delay. Bigger register file and an increased number of ports result in a taller register file layout, which translates to longer word-lines and bit-lines [7], thereby increasing wire propagation delay. Also, wire delays do not at all scale with the silicon technology improvements. Thus as register files grow in size, with faster transistors (smaller feature sizes), its only exacerbates their delay problem. Over the past decade, researchers have suggested a number of techniques for alleviating the problem of increased wire delay. Whenever a large block of silicon takes up a large fraction of the cycle time, it usually common to split the block up into smaller and more importantly faster pieces [6]. In the past, precious silicon area dictated logic reuse, but these days designers frequently duplicate logic to reduce wire lengths. We believe that these couple of ideas could be applied to the register files as well.
منابع مشابه
Partitioned Register File Designs for Clustered Architectures
The clustered architecture, where the conventional monolithic register file is partitioned into several smaller register files, is one of the candidates for the future high performance processor architectures. The aggressive partitioning can reduce the access time of the register file. On the other hand, the partitioning makes losses of instructions per clock cycle due to communication among re...
متن کاملReducing Register Pressure Through LAER Algorithm
When modern processors keep increasing the instruction window size and the issue width to exploit more instruction-level parallelism (ILP), the demand of larger physical register file is also on the increase. As a result, register file access time represents one of the critical delays and can easily become a bottleneck. In this paper, we first discuss the possibilities of reducing register pres...
متن کاملImproving Register File Banking with a Power-Aware Unroller
The complexity of the register file determines the cycle time of high performance wide-issue microprocessors due to the access time and size of this structure. Both parameters are directly related to the number of read and write ports of the register file. Therefore, it is a priority goal to reduce this complexity in order to allow the efficient implementation of complex superscalar machines. T...
متن کاملReducing Operand Transport Complexity of Superscalar Processors using Distributed Register Files
A critical problem in wide-issue superscalar processors is the limit on cycle time imposed by the central register file and operand bypass network. In this paper, a distributed register file architecture that employs fully distributed functional unit clusters is presented. It utilizes a local register mapping table and a dedicated register transfer network to support distributed register operat...
متن کاملCompiler-assisted power optimization for clustered VLIW architectures
Clustered VLIW architectures solve the scalability problem associated with flat VLIW architectures by partitioning the register file and connecting only a subset of the functional units to a register file. However, inter-cluster communication in clustered architectures leads to increased leakage in functional components and a high number of register accesses. In this paper, we propose compiler ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003